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In this paper we describe two bootstrap methods for massive data 
sets. Naive applications of common resampling methodology are of- 
ten impractical for massive data sets due to computational burden 
and due to complex patterns of inhomogeneity. In contrast, the pro- 
posed methods exploit certain structural properties of a large class of 
massive data sets to break up the original problem into a set of sim- 
pler subproblems, solve each subproblem separately where the data 
exhibit approximate uniformity and where computational complexity 
can be reduced to a manageable level, and then combine the results 
through certain analytical considerations. The validity of the pro- 
posed methods is proved and their finite sample properties are stud- 
ied through a moderately large simulation study. The methodology is 
illustrated with a real data example from Transportation Engineer- 
ing, which motivated the development of the proposed methods. 

1. Introduction. Statistical analysis and inference for massive data sets 
present unique challenges. Naive applications of standard statistical method- 
ology often become impractical, especially due to increase in computational 
complexity. While large data size is desirable from a statistical inference per- 
spective, suitable modification of existing statistical methodology is needed 
to handle such challenges associated with massive data sets. In this paper, 
we propose a novel resampling methodology, called the Gap Bootstrap, for 
a large class of massive data sets that possess certain structural properties. 
The proposed methodology cleverly exploits the data structure to break 
up the original inference problem into smaller parts, use standard resam- 
pling methodology to each part to reduce the computational complexity. 
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and then use some analytical considerations to put the individual pieces 
together, thereby alleviating the computational issues associated with large 
data sets to a great extent. 

The class of problems we consider here is the estimation of standard er- 
rors of estimators of population parameters based on massive multivariate 
data sets that may have heterogeneous distributions. A primary example is 
the origin-destination (OD) model in transportation engineering. In an OD 
model, which motivates this work and which is described in detail in Sec- 
tion 2 below, the data represent traffic volumes at a number of origins and 
destinations collected over short intervals of time (e.g., 5 minute intervals) 
daily, over a long period (several months), thereby leading to a massive data 
set. Here, the main goals of statistical analysis are (i) uncertainty quantifi- 
cation associated with the estimation of the parameters in the OD model 
and (ii) to improve prediction of traffic volumes at the origins and the desti- 
nations over a given stretch of the highway. Other examples of massive data 
sets having the required structural property include (i) receptor modeling 
in environmental monitoring, where spatio-temporal data are collected for 
many pollution receptors over a long time, and (ii) toxicological models for 
dietary intakes and drugs, where blood levels of a large number of toxins and 
organic compounds are monitored in repeated samples for a large number of 
patients. The key feature of these data sets is the presence of "gaps" which 
allow one to partition the original data set into smaller subsets with nice 
properties. 

The "largeness" and potential inhomogeneity of such data sets present 
challenges for estimated model uncertainty evaluation. The standard prop- 
agation of error formula or the delta method relies on assumptions of in- 
dependence and identical distributions, stationarity (for space-time data) 
or other kinds of uniformity which, in most instances, are not appropriate 
for such data sets. Alternatively, one may try to apply the bootstrap and 
other resampling methods to assess the uncertainty. It is known that the 
ordinary bootstrap method typically underestimates the standard error for 
parameters when the data are dependent (positively correlated). The block 
bootstrap has become a popular tool for dealing with dependent data. By 
using blocks, the local dependence structure in the data is maintained and, 
hence, the resulting estimates from the block bootstrap tend to be less bi- 
ased than those from the traditional (i.i.d.) bootstrap. For more details, 
see Lahiri (1999, 2003). However, computational complexity of naive block 
bootstrap methods increases significantly with the size of the data sets, as 
the given estimator has to be computed repeatedly based on resamples that 
have the same size as the original data set. In this paper, we propose two re- 
sampling methods, generally both referred to as Gap Bootstraps, that exploit 
the "gap" in the dependence structure of such large-scale data sets to reduce 
the computational burden. Specifically, the gap bootstrap estimator of the 
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standard error is appropriate for data that can be partitioned into approxi- 
mately exchangeable or homogeneous subsets. While the distribution of the 
entire data set is not exchangeable or homogeneous, it is entirely reasonable 
that many multivariate subsets will be exchangeable or homogeneous. If the 
estimation method that is being used is accurate, then we show that the 
gap bootstrap gives a consistent and asymptotically unbiased estimate of 
standard errors. The key idea is to employ the bootstrap method to each of 
the homogeneous subsets of the data separately and then combine the esti- 
mators from different subsets in a suitable way to produce a valid estimator 
of the standard error of a given estimator based on the entire data set. The 
proposed method is computationally much simpler than the existing resam- 
pling methods that require repeated computation of the original estimator, 
which may not be feasible simply due to computational complexity of the 
original estimator, at the scale of the whole data set. 

The rest of the paper is organized as follows. In Section 2 we describe the 
OD model and the data structure that motivate the proposed methodology. 
In Section 3 we give the descriptions of two variants of the Gap Bootstrap. 
Section 4 asserts consistency of the proposed Gap Bootstrap variance esti- 
mators. In Section 5 we report results from a moderately large simulation 
study, which shows that the proposed methods attain high levels of accuracy 
for moderately large data sets under various types of gap-dependence struc- 
tures. In Section 6 we revisit the OD models and apply the methodology to 
a real data set from a study of traffic patterns, conducted by an intelligent 
traffic management system on a test bed in San Antonio, TX. Some con- 
cluding remarks are made in Section 7. Conditions for the validity of the 
theoretical results and outlines of the proofs are given in the Appendix. 

2. The OD models and the estimation problem. 

2.1. Background. The key component of an origin-destination (OD) mod- 
el is an OD trip matrix that reflects the volume of traffic (number of trips, 
amount of freight, etc.) between all possible origins and destinations in a 
transportation network over a given time interval. The OD matrix can be 
measured directly, albeit with much effort and at great costs, by conducting 
individual interviews, license plate surveys, or by taking aerial photographs 
[cf. Cramer and Keller (1987)]. Because of the cost involved in collecting di- 
rect measurements to populate a traffic matrix, there has been considerable 
effort in recent years to develop synthetic techniques which provide "rea- 
sonable" values for the unknown OD matrix entries in a more indirect way, 
such as using observed data from link volume counts from inductive loop 
detectors. Over the past two decades, numerous approaches to synthetic 
OD matrix estimation have been proposed [Cascetta (1984), Bell (1991), 
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Fig. 1. The transportation network in San Antonio, TX under study. 



Okutani (1987), Dixon and Rilett (2000)]. One common approach for esti- 
mating the OD matrix from hnk volume counts is based on the least squares 
regression where the unknown OD matrix is estimated by minimizing the 
squared Euclidean distance between the observed link and the estimated 
link volumes. 

2.2. Data structure. The data are in the form of a time series of link 
volume counts measured at several on/off ramp locations on a freeway using 
an inductive loop detector, such as in Figure 1. 

Here Ok and , respectively, represent the traffic volumes at the fcth ori- 
gin and the fcth destination over a given stretch of a highway. The analysis 
period is divided into T time periods of equal duration At. The time series 
of link volume counts is generally periodic and weakly dependent, that is, the 
dependence dies off as the separation of the time intervals becomes large. For 
example, daily data over each given time slot of duration At are similar, but 
data over well separated time slots (e.g., time slots in Monday morning and 
Monday afternoon) can be different. This implies that the traffic data have 
a periodic structure. Further, Monday at 8:00-8:05 am data have nontrivial 
correlation with Monday at 8:05-8:10 am data, but neither data set says 
anything about Tuesday data at 8:00-8:05 am (showing approximate inde- 
pendence). Accordingly, let Yt,t = 1,2..., be a d-dimensional time series, 
representing the link volume counts at a given set of on/off ramp locations 
over the tth time interval. Suppose that we are interested in reconstructing 
the OD matrix for p-many short intervals during the morning rush hours, 
such as 36 link volume counts over At = 5-minute intervals, extending from 
8:00 am through 11:00 am, at several on/off ramp locations. Thus, the ob- 
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served data for the OD modeling is a part of the series, 



{Xi, . . . , Xp; . . . ; X/ 



'-(m— l)p+l )•••)- 

where the hnk volume counts are observed over the p-intervals on each day, 
for m days, giving a d-dimensional multivariate sample of size n = mp. There 
are q = T — p time slots between the last observation on any given day and 
the first observation on the next day, which introduces the "gap" structure 
in the X^-series. Specifically, in terms of the Yf-series, the X^-variables are 
given by 



X 



ip+j 



1, 



,p,i = 0, 



, m 



1. 



■ «(p+<7)+i' 

For data collected over a large transportation network and over a long period 
of time, d and m are large, leading to a massive data set. Observe that the 
Xt-variables can be arranged in a p x m matrix, where each element of the 
matrix-array gives a d-dimensional data value: 

/ Xi Xp_|_i 



(2.1) 



X(m-l)p+l \ 



X. X, 



p+2 



X 



(m-l)p+2 



\ X2p . . . Xmp / 

Due to the arrangement of the p time slots in the jth day along the jth 
column in (2.1), the rows in the array (2.1) correspond to a fixed time slot 
over days and are expected to exhibit a similar distribution of the link volume 
counts; although a day-of-week variation might be present, the standard 
practice in the Transportation engineering is to treat the weekdays as similar 
[cf. Roess, Prassas and McShane (2004), Mannering, Washburn and Kilareski 
(2009)]. On the other hand, due to the "gap" between the last time slot on 
the jth day and the first time slot of the (j + l)st day, the variables in the 
jth and the (j + l)st columns are essentially independent. Hence, this yields 
a data structure where 

(a) the variables within each column have serial correlations 

and possibly nonstationary distributions, 
(2.2) (b) the variables in each row are identically distributed, and > 
(c) the columns are approximately independent arrays 

of random vectors. 

In the transportation engineering application, each random vector Xt rep- 
resents the link volume counts in a transportation network corresponding to 
r origin (entrance) ramps and s destination (exit) ramps as shown in Fig- 
ure 1. Let oet and dkt, respectively, denote the link volumes at origin £ and 
at destination k at time t. Then the components of X^ for each t are given 
by the d = (r + s)-variables {ou :£= 1, . . . ,r}[J {d^t :k = l,. . . ,s}. Given the 
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link volume counts on all origin and destination ramps, the fraction pki 
(known as the OD split proportion) of vehicles that exit the system at des- 
tination ramp k given that they entered at origin ramp t can be calculated. 
This is because the link volume at destination k at time t, dkt, is a linear 
combination of the OD split proportions and the origin volumes at time t, 
oit's. In the synthetic OD model, Pke^s are the unknown system parameters 
and have to be estimated. Once the split proportions are available, the OD 
matrix for each time period can be identified as a linear combination of the 
split proportion matrix and the vector of origin volumes. The key statisti- 
cal inference issue here is to quantify the size of the standard errors of the 
estimated split proportions in the synthetic OD model. 

3. Resampling methodology. 

3.1. Basic framework. To describe the resampling methodology, we adopt 
a framework that mimics the "gap structure" of the OD model in Sec- 
tion 2. Let {Xi, . . . , Xp; . . . ; X(m_i)p_|_i, . . . , X^p} be a d-dimensional time 
series with stationary components {Xjp+j : i = 0, . . . , m — 1} for j = 1, . . . ,p 
such that the corresponding array (2.1) satisfies (2.2). For example, such a 
time series results from a periodic, multivariate parent time series Yj that is 
TTT-o-dependent for some tuq > and that is observed with "gaps" of length 
q > rriQ. In general, the dependence structure of the original time series 

is retained within each complete period {Xjp+j ■.j = l,...,p}, i = 0, . . . ,m, 
but the random variables belonging to two different periods are essentially 
independent. Let be a vector- valued parameter of interest and let On be an 
estimator of 9 based on Xi, . . . X„, where n = mp denotes the sample size. 
We now formulate two resampling methods for estimating the standard er- 
ror of 9n that are suitable for massive data sets with such "gap" structures. 
The first method is applicable when the p rows of the array (2.1) are ex- 
changeable and the second one is applicable where the rows are possibly 
nonidentically distributed and where the variables within each column have 
serial dependence. 

3.2. Gap Bootstrap I. Let X(-j) = (Xjp+j : i = 0, . . . , m — 1) denote the jth 
row of the array X in (2.1). For the time being, assume that the rows of X 
are exchangeable, that is, for any permutation (ji,...,jp) of the integers 
(1, . . . ,p), {X(jj), . . . ,X(jp)} have the same joint distribution as {X(i), . . . , 
X(p)}, although we do not need the full force of exchangeability for the 
validity of the method (cf. Section 4). For notational compactness, set X(o) = 
X. Next suppose that the parameter 9 can be estimated by using the row 
variables X(j) as well as using the complete data set, through estimating 
equations of the form 

^',(X(,);e) = 0, j=0,l,...,p, 
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resulting in the estimators 9jn, based on the jth row, for j = 1, . . . ,p, and 
the estimator On = 9on for j = based on the entire data set, respectively. 
It is obvious that for large values of p, the computation of OjnS can be 
much simpler than that of 9n, as the estimators 9jnS are based on a fraction 
(namely, ^) of the total observations. On the other hand, the individual 

9jnS lose efficiency, as they are based on a subset of the data. However, 
under some mild conditions on the score functions, the M-estimators can 
be asymptotically linearized by using the averages of the influence functions 
over the respective data sets X(j) [cf. Chapter 7, Serfling (1980)]. As a result, 
under such regularity conditions, 

p 

(3.1) 9n=P~^^0,n 

i=i 

gives an asymptotically equivalent approximation to 9n- Now an estimator 
of the variance of the original estimator 9n can be obtained by combining 
the variance estimators of the Oj^s through the equation 

p 

^Var(%„)+ Cov(^ 

.j=l l<j^k<p 

Note that using the i.i.d. assumption on the row variables, an estimator of 
Var(0j„) can be found by the ordinary bootstrap method (also referred to as 
the i.i.d. bootstrap in here) of Efron (1979) that selects a with replacement 
sample of size m from the jth row of data values. We denote this by Var(^j„) 
(and also by Sjn); i = 1, • • • ,P- Further, under the exchangeability assump- 
tion, all the covariance terms are equal and, hence, we may estimate the 
cross-covariance terms by estimating the variance of the pairwise differences 
as follows: 

a \ ^l<jT^k<pi^jn- 9kn){9jn- 9kn)' 1/ • /, ^ 

Yw{9j^^n-9kQn) = _ , l<Jo/A;o<p. 

Then, the cross covariance estimator is given by 

Cov(0jo„, ^fcon) = \^j(,n + Sfcon — ^&'^{9jon — 6'feon)]/2. 

Plugging in these estimators of the variance and the covariance terms in 

(3.2) yields the Gap Bootstrap Method I estimator of the variance of 9n as 



(3.2) Var(^„)=p 



-2 



(3.3) VarGB-i(6'n) =P" 



p 



YaT{9jn)+ > Cov{9jn,9kr. 



.j=l l<j^k<p 



Note that the estimator proposed here only requires computation of the 
parameter estimators based on the p subsets, which can cut down on the 
computational complexity significantly when p is large. 
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3.3. Gap Bootstrap II. In this section we describe a Gap Bootstrap 
method for the more general case where the rows X(j)'s in (2.1) are not 
necessarily exchangeable and, hence, do not have the same distribution. 
Further, we allow the columns of X to have certain serial dependence. This, 
for example, is the situation when the X^-series is obtained from a weakly 
dependent parent series {Y^} by systematic deletion of (^-components, creat- 
ing the "gap" structure in the observed X^-series as described in Section 2. 
If the Yf-series is mo-dependent with an mo < q, then {X^} satisfies the con- 
ditions in (2.2). For a mixing sequence Y^, the gapped segments are never 
exactly independent, but the effect of the dependence on the gapped seg- 
ments are practically negligible for large enough "gaps," so that approximate 
independence of the columns holds when q is large. We restrict attention to 
the simplified structure (2.2) to motivate the main ideas and to keep the 
exposition simple. Validity of the theoretical results continue to hold under 
weak dependence among the columns of the array (2.1); see Section 4 for 
further details. 

As in the case of Gap Bootstrap I, we suppose that the parameter 6 can be 
estimated by using the row variables X(j) as well as using the complete data 

set, resulting in the estimator 6jn, based on the jth. row for j = 1, . . . ,p and 
the estimator 9n = 6on (for j = 0) based on the entire data set, respectively. 
The estimation method can be any standard method, including those based 
on score functions and quasi-maximum likelihood methods, such that the 
following asymptotic linearity condition holds: 

There exist known weights win, ■ ■ ■ ,Wpn £ [0,1] with X]j=i^jn ~ ^ such 
that 

p 

(3.4) 9n — WjnOjn = op{n~^^'^) as n — >■ OO. 

i=i 

Classes of such estimators are given by (i) L-, M- and R-estimators of 
location parameters [cf. Koul and Mukherjee (1993)], (ii) differentiable func- 
tional of the (weighted) empirical process [cf. Serfling (1980), Koul (2002)], 
and (iii) estimators satisfying the smooth function model [cf. Hall (1992), 
Lahiri (2003)]. An explicit example of an estimator satisfying (3.4) is given 
in Remark 3.5 below [cf. (3.9)] and the details of verification of (3.4) are 
given in the Appendix. 

Note that under (3.4), the asymptotic variance of n^/'^{6n — d) is given 
by the asymptotic variance of ^^^iWjnn^^'^{Ojn — G). The latter involves 
both variances and covariances of the row-wise estimators 9jnS. The Gap 
Bootstrap method II estimator of the variance of On is obtained by com- 
bining individual variance estimators of the marginal estimators 9jnS with 
estimators of their cross covariances. Note that as the row-wise estimators 
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6jn are based on (approximately) i.i.d. data, as in the case of Gap Bootstrap 
method I, one can use the i.i.d. bootstrap method of Efron (1979) within 
each row ^(j) and obtain an estimator of the standard error of each 9jn- We 

continue to denote these by Var(^j„), 1 < j <p, as in Section 3.2. However, 
since we now allow the presence of temporal dependence among the rows, 
resampling individual observations is not enough [cf. Singh (1981)] for cross- 
covariance estimation and some version of block resampling is needed [cf. 
Kiinsch (1989), Lahiri (2003)]. As explained earlier, repeated computation 
of the estimator 9n based on replicates of the full sample may not be feasible 
merely due to the associated computational costs. Instead, computation of 
the replicates on smaller portions of the data may be much faster (as it avoids 
repeated resampling) and stable. This motivates us to consider the sampling 
window method of Politis and Romano (1994) and Hall and Jing (1996) for 
cross-covariance estimation. Compared to the block bootstrap methods, the 
sampling window method is computationally much faster but at the same 
time, it typically achieves the same level of accuracy as the block bootstrap 
covariance estimators, asymptotically [cf. Lahiri (2003)]. The main steps of 
the Gap Bootstrap Method H are as follows. 

3.3.1. The univariate parameter case. For simplicity, we first describe 
the steps of the Gap Bootstrap Method II for the case where the parameter 
9 is one- dimensional: 

Steps: 

(I) Use i.i.d. resampling of individual observations within each row to 
construct a bootstrap estimator Var(6'jn) of Var(0j>t), j = 1, . . . ,p, as 
in the case of Gap Bootstrap method I. In the one-dimensional case, 
we will denote these by (t|^, j = 1, . . . ,p. 

(II) The Gap Bootstrap II estimator of the asymptotic variance of 9n is 
given by 

p p 

(3.5) = WjnWkn^jn^knPn {j, k) , 

j=l k=l 

where is as in Step I and where pn{j,k) is the sampling window 

estimator of the asymptotic correlation between 9jn and 9kn, described 
below. 

(HI) To estimate the correlation Pnij,k) between 9jn and 9kn by the sam- 
pling window method [cf. Politis and Romano (1994) and Hall and 
Jing (1996)], first fix an integer i G (l,m). Also, let 

X*^"^^ = (Xi, . . . ,Xp), X*^^^ = (Xp+i, . . . ,X2p), . . . , 

X*^*"^ = (X(^_i)p+i, . . . , Xmp) 
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denote the columns of the matrix array (2.1). The version of the sam- 
phng window method that we wih employ here will be based on (over- 
lapping) subseries of (. columns. The following are the main steps of 
the sampling window method: 

(Ilia) Define the overlapping subseries of the column-variables X^'^ of 
length I as 



;fi = (x»,...,x(^+^-^ 



1,...,/, 



where I = m — l + l. Note that each subseries Xi contains I com- 
plete columns or periods and consists of ^p-many X^-variables. 
(Illb) Next, for each i = 1, . . . ,/, we employ the given estimation al- 
gorithm to the Xj-variables in Xi to construct the subseries 
version 6'j2 of 9jn, j = 1; • • • (There is a slight abuse of nota- 
tion here, as the sample size for the ith subseries of X^-variables 
is Ip, not n = mp and, hence, we should be using ^^(^p) instead 

of ^^2' but we drop the more elaborate notation for simplicity). 
(IIIc) For i < j < k < p, the sampling window estimator of the corre- 
lation between 6jn and 6kn is given by 



(3.6) 



9(0 

jn 



9n) 



jn 



\2]l/2 



3.3.2. The multivariate parameter case. The multivariate version of the 
Gap bootstrap estimator of the variance matrix of a vector parameter es- 
timator On can be derived using the same arguments, with routine changes 
in the notation. Let Tijn denote the bootstrap estimator of Var(^j.„), based 
on the i.i.d. bootstrap method of Efron (1979). Next, with the subsampling 
replicates ^j^, j = I, . . . ,p, based on the overlapping blocks {Xi :i = 1, . . . , 1} 
of £ columns each (cf. Step [III] of Section 3.3.1), define the sampling window 
estimator TZnUi ^) °f the correlation matrix of 9jn and Okn as 



1=1 



-1/2 



^-^E(^2-^n)(e-^n)' 
i=l 

I 



i=l 



-1/2 
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Then the variance estimator based on Gap bootstrap II is given by 

p p 

(3.7) VarGB-iilf^n) = ^^tL'j„WfenS]{^:^„(i, k)tl^^. 

j=i k=i 

3.3.3. Some comments on Method II. 

Remark 3.1. Note that for estimators {Ojn'-j = ^,---,p} with large 
asymptotic variances, estimation of the correlation coefficients by the sam- 
pling window method is more stable, as these are hounded (and have a 
compact support). On the other hand, the asymptotic variances of OjnS 
have an unbounded range of values and therefore are more difficult to esti- 
mate accurately. Since variance estimation by Efron (1979)'s bootstrap has a 
higher level of accuracy [e.g., Op(n~^/^)] compared to the sampling window 
method variance estimation [with the slower rate Op([^/n]^/^ + see 
Lahiri (2003)], the proposed approach is expected to lead to a better overall 
performance than a direct application of the sampling window method to 
estimate the variance of On- 

Remark 3.2. Note that all estimators computed here (apart from a 
one-time computation of On in the sampling window method) are based on 
subsamples and hence are computationally simpler than repeated computa- 
tion of On required by naive applications of the block resampling methods. 

Remark 3.3. For applying Gap Bootstrap II, the user needs to specify 
the block length /. Several standard block length selection rules are available 
in the block resampling literature [cf. Chapter 7, Lahiri (2003)] for estimat- 
ing the variance-covariance parameters. Any of these are applicable in our 
problem. Specifically, we mention the plug-in method of Patton, Politis and 
White (2009) that is computationally simple and, hence, is specially suited 
for large data sets. 

Remark 3.4. The proposed estimator remains valid (i.e., consistent) 
under more general conditions than (2.2), where the columns of the array 
(2.1) are not necessarily independent. In particular, the proposed estimator 
in (3.7) remains consistent even when the variables in the array (2.1) are 
obtained by creating "gaps" in a weakly dependent (e.g., strongly mixing) 
parent time series Y^. This is because the subsampling window method 
employed in the construction of the cross-correlation can effectively capture 
the residual dependence structure among the columns of the array (2.1). The 
use of i.i.d. bootstrap to construct the variance estimators ^jn is adequate 
when the gap is large, as the separation of two consecutive random variables 
within a row makes the correlation negligible. See Theorem 4.2 below and 
its proof in the Appendix. 
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Remark 3.5. An alternative, intuitive approach to estimating the vari- 
ance of On is to consider the data array (2.1) by columns rather than by rows. 
Let 9^^\ . . . ,9^"^^ denote the estimates of 9 based on the m columns of the 
data matrix X. Then, assuming that the columns of X are (approximately) 
independent and assuming that are identically distributed, one 

may be tempted to estimate Var(^„) by using the sample variance of the 
9^^\ . . . ,9^'^\ based on the following analog of (3.1): 

m 

(3.8) On^m^^Y,^^''^- 

k=l 

However, when p is small compared to m, such an approximation is sub- 
optimal, and this approach may drastically fail if p is fixed. As an illustrating 
example, consider the case where the Xj's are 1-dimensional random vari- 
ables, p>l is fixed (i.e., it does not depend on the sample size), n = mp, 
and the columns 'K^''\k = l,...,m, have an "identical distribution" with 
mean vector (/x, . . . ,/x)' G and p x p covariance matrix S. For simplicity, 
also suppose that the diagonal elements of S are all equal to cr^ € (0,oo). 
Let 

n 
i=l 

an estimator of 9 = p~^ trace(S) = cx^. Let 9^^^ and Ojn, respectively, denote 
the sample variance of the X^'s in the A;th column and the jth row, k = 
1, . . . , m and j = 1, . . . ,p. Then, in Appendix A.l, we show that 

p 

(3.9) 9n = p~^Y9jn + Op{n~^/^), 

i=i 

while 

m 

(3.10) 9n = m-^Y,^^''^ +p-'^l'm + Op{n-^/^), 

k=l 

where 1 is the p x 1 vector of I's. Thus, in this example, (3.4) holds with 
Wjn = p~^ for 1 < j < p. However, (3.10) shows that the column-wise ap- 
proach based on (3.8) results in a very crude approximation which fails to 
satisfy an analog of (3.4). For estimating the variance of 9n, the deterministic 
term has no effect, but the Op(n~^/^)-term in (3.10) has a nontriv- 

ial contribution to the bias of the resulting column-based variance estimator, 
which can not be made negligible. As a result, this alternative approach fails 
to produce a consistent estimator for fixed p. In general, caution must be 
exercised while applying the column-wise method for small p. 
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4. Theoretical results. 

4.1. Consistency of Gap Bootstrap I estimator. The Gap Bootstrap I es- 
timator VarGP-i(^n) of the (asymptotic) variance matrix of 9n is consistent 
under fairly mild conditions, as stated in Appendix A. 2. Briefly, these con- 
ditions require (i) homogeneity of pairwise distributions of the centered and 
scaled estimators {m^^^{9jn — ^) : 1 < j < p}, (ii) some moment and weak 
dependence conditions on the m^^^{6jn — 0)'s, and (iii) p oo as n — > oo. 
In particular, the rows of X need not be exchangeable. Condition (iii) is 
needed to ensure consistency of the estimator of the covariance term(s) in 
(3.3), which is defined in terms of the average of the p{p — 1) pair-wise dif- 
ferences {9jn — Okn : 1 < J 7^ ^ < p}- Thus, for employing the Gap Bootstrap 
I method in an application, p{p — 1) should not be too small. 

The following result asserts consistency of the Gap Bootstrap I variance 
(matrix) estimator. 

Theorem 4.1. Under conditions (A.l) and (A. 2) given in the Appendix, 
as oo, 

7T.[VarGB-l(^n) — Var(^„)] in probability. 

4.2. Consistency of Gap Bootstrap II estimator. Next consider the Gap 
Bootstrap II estimator of the (asymptotic) variance matrix of On- Consis- 
tency of VarGB-ii(^n) holds here under suitable regularity conditions on the 
estimators {Ojn : 1 < i < p} and the length of the "gap" q for a large class 
of time series that allows the rows of the array (2.1) to have nonidentical 
distributions. See the Appendix for details of the conditions and their im- 
plications. It is worth noting that unlike Gap Bootstrap I, here the column 
dimension p need not go to infinity for consistency. 

Theorem 4.2. Under conditions (C.1)-(C.4), given in the Appendix, 
as oo, 

n[VarGB-H(^n) — Var(^„)] — )■ in probability. 

5. Simulation results. To investigate finite sample properties of the pro- 
posed methods, we conducted a moderately large simulation study involving 
different univariate and multivariate time series models. For the univariate 
case, we considered three models: 

(I) Autoregressive (AR) models of order two (Xt = fi + Yt where Yj = 
aiYt_i + a2Yt_2 + Wt). 

(II) Moving average (MA) models of order two {Xt = ^ + Yt where Yj = 
^iWt-i+f32Wt-2 + Wt). 

(Ill) A periodic time series model {Xt = fit + Wt, Wt = o'Et), 
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where Wt = cret and {et} are i.i.d. random variables with zero mean and unit 
variance. The parameter values of the AR models are ai = 0.8, 02 = 0.1 with 
constant mean ^ = 0.1 and with a = 0.2. Similarly, for the MA models, we 
took the MA-parameters as /3i = 0.3, /32 = 0.5, and set a = 0.2 and /U = 0.1. 
For the third model, the mean of the Xj-variables were taken as a periodic 
function of time t: 

fit = fJ^ + cos 2TTt /p + sin 27rt /p 

with ^ = 1.0 and p € {5, 10, 20} and with a = 0.2. In all three cases, the et are 
generated from two distributions, namely, (i) A^(0, l)-distribution and (ii) a 
centered Exponential (1) distribution, to compare the effects of nonnormality 
on the performance of the two methods. Note that the rows of the generated 
X are identically distributed for models I and II but not for model III. We 
considered six combinations of where n denotes the sample size and 

p the number of time slots (or the periodicity). The parameter of interest 6 
was the population mean and the estimator 6^ was taken to be the sample 
mean. Thus, the row-wise estimators 6jn were the sample means of the 
row-variables and the weights in (3.4) were Wjn = 1/p for all j = 1, . . . ,p. 
In all, there are (3x2x6=) 36 possible combinations of (n,p)-pairs, the 
error distributions, and the three models. To keep the size of the paper 
to a reasonable length, we shall only present 3 combinations of (n,p) in 
the tables, while we present side- by-side box-plots for all 6 combinations 
of {n,p), arranged by the error distributions. All results are based on 500 
simulation runs. 

Figures 2 and 3 give the box-plots of the differences between the Gap 
Bootstrap I standard error estimates and the true standard errors in the 
one-dimensional case under centered exponential and under normal error 
distributions, respectively. Here box-plots in the top panels are based on the 
AR(2) model, the middle panels are based on the MA(2) model, while the 
bottom panels are based on the periodic model. For each model, the combi- 
nations of {n,p) are given by {n,p) = (200, 5), (500, 10), (1800, 30), (3500, 50), 
(6000, 75), (10,000, 100). 

Similarly, Figures 4 and 5 give the corresponding box-plots for the Gap 
Bootstrap II method under centered exponential and under normal error 
distributions, respectively. 

From the Figures 4 and 5, it is evident that the variability of the standard 
error estimates from the Gap Bootstrap I Method is higher under Models I 
and II than under Model III for both error distributions. However, the bias 
under Model III is persistently higher even for larger values of the sample 
size. This can be explained by noting that for Method I, the assumption 
of approximate exchangeability of the rows is violated under the periodic 
mean structure of Model III, leading to a bigger bias. In comparison. Gap 



GAP BOOTSTRAP FOR MASSIVE DATA SETS 



15 




(200, 5) (500, 10) (1800, 30) (3500, 50) (6000, 75) (10000, 100) 

(n, P) 

Fig. 2. Box-plots of the differences between the standard error estimates based on Gap 
Bootstrap I and the true standard errors in the one-dimensional case using 500 simulation 
runs. Here, plots in the first panel are based on Model I, those m the second and third panels 
are based on Models 11 and HI, respectively. The values of {n,p) for each box-plot are given 
at the bottom of the third panel. The innovation distribution is centered exponential. 

Bootstrap II estimates tend to center around the target value (i.e., with 
differences around zero) even for the periodic model. Table 1 gives the true 
values of the standard errors of On based on Monte-Carlo simulation and the 
corresponding summary measures for Gap Bootstrap methods I and II in 18 
out of the 36 cases [we report only the first 3 combinations of {n,p) to save 
space. A similar pattern was observed in the other 18 cases]. 
From the table, we make the following observations: 

(i) The biases of the Gap Bootstrap I estimators are consistently higher 
than those based on Method II under Models I and II for both normal 
and nonnormal errors, resulting in higher overall MSEs for Gap Bootstrap 

I estimators. 

(ii) Unlike under Models I and II, here the biases of the two methods 
can have opposite signs. 

(iii) From the last column of Table 1 (which gives the ratios of the MSEs 
of estimators based on Methods I and II), it follows that the Gap Bootstrap 

II works significantly better than Gap Bootstrap I for Models I and II. For 
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Fig. 3. Box-plots for the differences of Gap Bootstrap I estimates and the true standard 
errors as in Figure 2, but under normal innovation distribution. 



Model III, neither method dominates the other in terms of bias and/or MSE. 
MSE comparison shows a curious behavior of Method I at (n,p) = (500, 10) 
for the periodic model. 

(iv) The nonnormality of the 's does not seem to have significant effects 
on the relative accuracy of the two methods. 

Next we consider performance of the two gap Bootstrap methods for mul- 
tivariate data. The models we consider are analogs of (I)-(III) above, with 
the general structure 

Yt = (0.2,0.3,0.4,0.5)' + Zt, t > 1, 

where is taken to be the following: (IV) a multivariate autoregressive 
(MAR) process, (V) a multivariate moving average (MMA) process, and 
(VI) a multivariate periodic process. For the MAR process, 

Zt = 'I'Zt.i + et, 

where 
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(200, 5) 
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(3500, 50) 



(6000, 75) 



(10000, 100) 



(n, P) 



Fig. 4. Box-plots of the differences of standard error estimates based on Gap Bootstrap 
II and the true standard errors in the one-dimensional case, as in Figure 2, under the 
centered exponential innovation distribution. 



and the ej are i.i.d. d = 4 dimensional normal random vectors with mean 
and covariance matrix Sq, where we consider two choices of Sq: 

(i) So is the identity matrix of order 4; 

(ii) So has (i, j)th element given 

by (-/o)''"^', l<2,i<4, with p = 0.55. 

For the MMA model, we take 

Zt = $iet_i + ^2et-2 + et, 
where are as above. The matrix of MA coefficients are given by 
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where, in both $1 and ^2, the *'s are generated by using a random sample 
from the UNIFORM (0, 1) distribution [i.e., random numbers in (0, 1)] and 
are held fixed throughout the simulation. We take $1 and $2 as lower tri- 
angular matrices to mimic the structure of the OD model for the real data 
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Fig. 5. Box-plots of the differences of standard error estimates based on Gap Bootstrap 
II and the true standard errors in the one-dimensional case, as in Figure 2, under the 
normal innovation distribution. 



example that will be considered in Section 6 below. Finally, the observations 
Xt under the periodic model (VI) are generated by stacking the univariate 
case with the same p, but with \x changed to the the vector (0.2, 0.3, 0.4, 0.5). 
The component- wise values of a\ and are kept the same and the et's for 
the 4 components are now given by the e^'s, with the two choices of the 
covariance matrix. 

The parameter of interest is the mean of component-wise means, that is, 

6» = /2 = [0.2 + 0.3 + 0.4 + 0.5]/4. 

The estimator 0„ is the mean of the component- wise means of the entire data 
set and 0^"^^ is given by the mean of the component-wise means coming from 
the ith row of n/p-many data vectors, for j = 1, . . . ,p. Box-plots of the dif- 
ferences between the true standard errors of 0„ and their estimates obtained 
by the two Gap Bootstrap methods are reported in Figures 6 and 7, respec- 
tively. We only report the results for the models with covariance structure 
(ii) above (to save space). 

The number of simulation runs is 500 as in the univariate case. From the 
figures it follows that the relative patterns of the box-plots mimic those in the 
case of the univariate case, with Gap Bootstrap I leading to systematic biases 



GAP BOOTSTRAP FOR MASSIVE DATA SETS 



19 



Table 1 

Bias and MSEs of Standard Error estimates from Gap Bootstraps I and II for univariate 

data for Models I-III. For each model, the two sets of 3 rows correspond to 
(n,p) = (200,5), (500, 10), (1800,30) under the normal (denoted by N m the first column) 

and the centered Exponential (denoted by E) error distributions, respectively. Here 
B-I= Bias of Gap Bootstrap / xlO^, M-I= MSE of Gap Bootstrap / xlO'', B-II^ Bias 
of Gap Bootstrap II xlO'^ , and M-II— MSE of Gap Bootstrap II xlO**. Column 2 gives 
the target parameter evaluated by Monte-Garlo simulations and the last column is the 

ratio of columns 4 and 6 



Model 


True-se 


B-I 


M-I 


B-II 


M-II 


Ratio (fix) 


I.N.I 


0.013 


-0.831 


0.708 


-0.376 


0.029 


24.4 


I.N.2 


0.011 


-0.700 


0.503 


-0.118 


0.0202 


25.2 


I.N.3 


0.008 


-0.481 


0.241 


-0.256 


0.0142 


17.2 


I.E.I 


0.065 


-4.18 


17.8 


-1.97 


0.623 


28.6 


I.E.2 


0.053 


-3.54 


12.8 


-1.52 


0.451 


28.4 


I.E.3 


0.038 


-2.41 


6.04 


-0.844 


0.348 


17.4 


II.N.l 


0.005 


-0.240 


0.061 


-0.178 


0.008 


7.6 


II.N.2 


0.003 


-0.154 


0.026 


-0.122 


0.004 


6.5 


II.N.3 


0.002 


-0.081 


0.007 


-0.087 


0.001 


7.0 


lI.E.l 


0.023 


-1.22E 


1.59 


-1.18 


0.183 


8.9 


II.E.2 


0.015 


-0.767 


0.657 


-0.288 


0.101 


6.5 


II.E.3 


0.008 


-0.398 


0.184 


-0.092 


0.025 


7.4 


III.N.l 


0.003 


-0.125 


0.016 


-0.183 


0.005 


3.2 


III.N.2 


0.002 


-0.0263 


0.0008 


-0.065 


0.002 


0.4 


III.N.3 


0.001 


0.059 


0.004 


-0.028 


0.0004 


10.0 


III.E.l 


0.014 


-0.619 


0.386 


-0.549 


0.094 


4.1 


III.E.2 


0.009 


-0.158 


0.026 


-0.506 


0.042 


0.6 


III.E.3 


0.005 


0.292 


0.086 


-0.216 


0.010 


8.6 



under the periodic mean structure. For comparison, we have also considered 
the performance of more standard methods, namely, the overlapping versions 
of the Subsampling (SS) and the Block Bootstrap (BB). 

Figures 8 and 9 give box-plots of the differences between the true standard 
errors of On and their estimates obtained by SS and BB methods, under 
Models (IV)-(VI) with covariance structure (ii). The choice of the block size 
was based on the block length selection rule of Patton, Politis and White 
(2009). From the figures, it follows that the relative performances of the SS 
and the BB methods are qualitatively similar and both methods handily 
outperform Gap Bootstrap I. 

These qualitative observations are more precisely quantified in Table 2 
which gives the MSEs of all 4 methods for models (IV)-(VI) for all six com- 
binations of (n,p) under covariance structure (ii). It follows from the table 
that Gap Bootstrap Method II has the best overall performance in terms of 
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0.03 - 





(200, 5) (500, 10) (1800, 30) (3500, 50) (6000, 75) (10000, 100) 



(n, P) 



Fig. 6. Box-plots of the differences of standard error estimates based on Gap Bootstrap I 
and the true standard errors in the multivariate case, under the Type 11 error distribution. 
The number of simulation runs is 500. Also, the models and the values of {n,p) are depicted 
on the panels as m Figure 2. 



the MSE. This may appear somewhat counter-intuitive at first glance, but 
the gain in efficiency of Gap Bootstrap II can be explained by noting that 
it results from judicious choices of resampling methods for different parts 
of the target parameter, as explained in Section 3.3.3 (cf. Remark 3.1). On 
the other hand, in terms of computational time, Gap Bootstrap I had the 
best possible performance, followed by the SS, Gap Bootstrap II and the 
BB methods, respectively. Since the basic estimator is computationally 
very simple (being the sample mean), the computational time may exhibit 
a very different relative pattern (e.g., for On requiring high-dimensional ma- 
trix inversion, the BB method based on the entire data set may be totally 
infeasible). 

6. A real data example: The OD estimation problem. 

6.1. Data description. A 4.9 mile section of Interstate 10 (I-IO) in San 
Antonio, Texas was chosen as the test bed for this study. This section of free- 
way is monitored as part of San Antonio's TransGuide Traffic Management 
Center, an intelligent transportation systems application that provides mo- 
torists with advanced information regarding travel times, congestion, acci- 
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(200, 6) (500, 10) (1800, 30) (3S00, 50) (6000, 75) (10000, 100) 

(n. P) 



Fig. 7. Box-plots of the differences of standard error estimates based on Gap Bootstrap 
II and the true standard errors in the multivariate case, under the setup of Figure 6. 

dents and other traffic conditions. Archived hnk volume counts from a series 
of 14 inductive loop detector locations (2 main lane locations, 6 on-ramps 
and 6 off-ramps) were used in this study (see Figure 1). The analysis is based 
on 575 days of peak AM (6:30 to 9:30) traffic count data (All weekdays — 
January 1, 2007 to March 13, 2009). Each day's data were summarized into 
36 volume counts of 5-minute duration. Thus, there were a total of 20,700 
time points, and each time point giving 14 origin-destination traffic data, 
resulting in more than a quarter-million data-values. Figures 10 and 11 are 
plots showing the periodic behavior of the link volume count data at the 7 
origin (01 to 07) and 7 destination (Dl to D7) locations, respectively. 

6.2. A synthetic OD model. As described in Section 2, the OD trip ma- 
trix is required in many traffic applications such as traffic simulation mod- 
els, traffic management, transportation planning and economic development. 
However, due to the high cost of direct measurements, the OD entries are 
constructed using synthetic OD models [Cascetta (1984), Bell (1991), Oku- 
tani (1987), Dixon and Rilett (2000)]. One common approach for estimating 
the OD matrix from link volume counts is based on the least squares regres- 
sion where the unknown OD matrix is estimated by minimizing the squared 
Euclidean distance between the observed link volumes and the estimated 
link volumes. 
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{200, 5) (500, 10) (1800, 30) (3600, 50) (6000, 75) (10000, 100) 

(n, P) 



Fig. 8. Box-plots of the difference of standard error estimates based on Subsampling and 
the true standard errors in the multivariate case, under the setup of Figure 6. 

Given the link volume counts on all origin and destination ramps, the OD 
split proportion, pij (assumed homogeneous over the morning rush-hours), 
is the fraction of vehicles that exit the system at destination ramp djt given 
that they enter at origin ramp oa at time point t (cf. Section 2). Once the 
split proportions are known, the OD matrix for each time period can be 
identified as a linear combination of the split proportion matrix and the 
vector of origin volumes. It should be noted that because the origin volumes 
are dynamic, the estimated OD matrix is also dynamic. However, the split 
proportions are typically assumed constant so that the OD matrices by time 
slice are linear functions of each other [Gajewski et al. (2002)]. While this is 
a reasonable assumption for short freeway segments over a time span with 
homogeneous traffic patterns like the ones used in this study, it elicits the 
question as to when trips began and ended when used on larger networks over 
a longer tie span. It is also assumed that all vehicles that enter the system 
from each origin ramp during a given time period exit the system during 
the same time period. That is, it is assumed that conservation of vehicles 
holds, so that the sum of the trip proportions from each origin ramp equals 
1. Caution should be exercised in situations where a large proportion of trips 
begin and end during different time periods [Gajewski et al. (2002)]. Note 
also that some split proportions such as p2i are not feasible because of the 
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Fig. 9. Box-plots of the differences of standard error estimates based on the block boot- 
strap and the true standard errors m the multivariate case, under the setup of Figure 6. 



structure of the network. Moreover, all vehicles that enter the freeway from 
origin ramp 7 go through destination ramp 7 so that pn = 1. All of these 
constraints need to be incorporated into the estimation process. 

Let djt denote the volume at destination j over the tth time interval (of 
duration 5 minutes) and Ojt denote the jth origin volume over the same 
period. Let pij be the proportion of origin i volume contributing to the 
destination j volume (assumed not to change over time). Then, the synthetic 
OD model for the link volume counts can be described as follows: 

For each t, 

dit = oitpii + 

d2t = OltPl2 + 02tP22 + £2t, 
dst = OltPlS + 02tP23 + 03tP33 + £3t, 
(6.1) d4t = OitPu + 02tP2A + 03tP3A + O^tPAA: + ^At, 

du = OltPlS + 02iP25 + OSiPSS + 04tP45 + OStPSS + ^54, 
dm = OitPlQ + 02iP26 + OSiPSe + 04tP46 + 05tP56 + o^Pm + £6i, 
dlt = OuPn + 02tP27 + 03tP37 + OAtPAT + 05tP57 + OQtP67 
+ 07tP77 + e7t, 
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Table 2 

MSEs of Standard Error estimates from Gap Bootstraps I and II and the Subsampling 
(SS) and Block Bootstrap (BB) methods for the multivariate data for Models IV-VI 
under covariance matrix of type (ii) . The six rows under each model correspond to 
(n,p) = (200,5), (500,10), (1800,30), (3500,50), (6000, 75), (10,000, 100). Further, the 
entries in the table gives the values of the MSEs multiplied 10*, 10* and W"' for 
Models IV-VI, respectively 



Model 


True-se 


GB-I 


GB-II 


SS 


BB 


IV. 1 


0.044 


6.190 


0.634 


1.390 


1.510 


IV.2 


0.030 


2.970 


0.353 


0.568 


0.567 


IV.3 


0.017 


0.873 


0.116 


0.151 


0.162 


IV.4 


0.012 


0.451 


0.064 


0.078 


0.082 


IV.5 


0.009 


0.247 


0.034 


0.040 


0.042 


IV.6 


0.007 


0.155 


0.017 


0.020 


0.020 


V.l 


0.076 


14.300 


2.350 


3.690 


4.040 


V.2 


0.053 


7.560 


1.190 


1.650 


1.690 


V.3 


0.028 


2.060 


0.300 


0.374 


0.427 


V.4 


0.019 


0.930 


0.144 


0.165 


0.176 


V.5 


0.015 


0.590 


0.080 


0.094 


0.099 


V.6 


0.011 


0.297 


0.037 


0.043 


0.045 


VI.l 


0.022 


10.300 


2.400 


3.150 


3.440 


VI.2 


0.014 


4.250 


0.918 


1.110 


1.100 


VI.3 


0.007 


2.230 


0.215 


0.257 


0.291 


VI.4 


0.005 


3.860 


0.111 


0.134 


0.140 


VI.5 


0.004 


4.620 


0.069 


0.073 


0.074 


VI.6 


0.003 


4.350 


0.032 


0.036 


0.038 



where £jt are (correlated) error variables. Note that the parameters pij sat- 
isfy the conditions 

7 

(6.2) X1kj = 1 for i = l,...,7. 

j=i 

In particular, pj^ = 1. Because of the above linear restrictions on the Pij^s, it 
is enough to estimate the parameter vector p = (pii,pi2, • • • ,Pi6',P22, ■ ■ ■ ,P26', 
■ ■ ■ ;P66)'- We relabel the components and write p = {9^^\ . . . , 9^"^^^)' = 6. We 
will estimate these parameters by the least squares method using the entire 
data, resulting in the estimator 9n and using the daily data over each of 
the 36 time intervals of length 5 minutes, yielding 6jn, j = 1,...,24. For 
notational simplicity, we set 9on = 9n ■ 

For t = 1, . . . , 20,700, let A = (d^, . . . , d^^d-jt - ELi ^u)' and let Ot be 
the 7 X 21 matrix given by 



o,=[or^...:on, 
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Fig. 10. Plots of the origin volume counts for the San Antonio, TX data (including 
weekend days). 



26 LAHIRI, SPIEGELMAN, APPIAH AND RILETT 

™ 1 DtiMnaHon 1 

nQ'niBr^ioiA'Wmr4^E3mar^uinvnrviriOniDnukn*rn<NiHag^Dr<-uviq<<iNiHOn 



D •- 



S'MlnuTB Interval 

Destination 3 



S50 
S 100 




■HiH'iH'pHiHnivi%ir4|^min4VimMtf4'<itq'44ini/binini/>U]iau>(aiar^r<-rvr<-r^r^ 

E-Minuta bittrvii 




i«u>iBi:3NwuinO'Hmi#ir~D>--<mirir<-in£]r4«iaBiD<N«uWHiiiiHin4^ 
5-MI[3uta litttrvH 

Destination G 



■-^^^■^T*T^MrlM^vr^f^^w^fllmmw»»^'W»«rf1lrtl^llln^ou>lB^I^^e^•■r-^^»^^^tlllI^mrtmm[7lm 



Fig. 11. PZots o/ t/ie destination volume counts for the San Antonio, TX data (including 
weekend days). 
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where, for A; = 1, . . . , 6, O^^^ is a 7 x (7 — A;) matrix with its last row given by 
{—Okt, • • • , —Okt) and the rest of the elements by 

{of% = oj,tl{i>l 
For j = 0,l,...,36, let 



(of 1),^. = out{i >k)l{j = i-k + l), i = 1, . . . , 6, j = 1, . . . , 7 - fc. 



(6.3) 



where Tj = + 36, . . . , j + (574 x 36)} for j = 1, . . . , 36 and where Tq = 
{1,...,720}. Note that each of Ti,...,r36 has size 575 (the total number 
of days) and corresponds to the counts data over the respective 5 minute 
period, while Tq has size 20,700 and it corresponds to the entire data set. 
For applying Gap Bootstrap II, we need a minor extension of the formulas 
given in Section 3.3, as the weights in (3.4) now vary component-wise. For 
j = 0, 1, . . . , 36, define Tjn = J2teT ^'tOt- Then, the following version of (3.4) 
holds [without the Op(l) term]: 

36 

On = ^ WjnOjn, 

i=i 

where Wjn = FQ^^Pj^. This can be proved by noting that 

36 36 
Gn = Fq-^ Y1 O'tDt = Fo^l Yl E O'tDt = ^^r^e^n. 
t£To 3=1 t&Tj j=l 

The Gap Bootstrap II estimator of the variance of the individual components 
6n^ , . . . , 6n^^ of the estimator On is now given by 

36 36 

^(4"^) = Y.Y. ^ak(TalPa{k, I), a = 1, . . . , 21, 

k=l 1=1 

where (5"^^ = w^^S^'^'Wafc, S'^'^^ is the i.i.d. bootstrap based estimator of the 
variance matrix of pa{k,j) is the sampling window estimator of the 
correlation between the ath component of the kth and jth row-wise es- 
timators of 9 and wife's are weights based on WjnS. Indeed, with ei = 
(1,0,..., 0)', . . . , 621 = (0, . . . , 1)', we have w^^ = e'^T^^Tjn, l<j< 36. To 
find pa(A:,j)'s, we applied the sampling window method estimator with 
i=17 and the following formula for pa{k,j): 

. ,^ . _ Eli(w;,[C - gn])(w^[g;2 - 9n]) 
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Table 3 

Standard Error estimates from Gap Bootstraps I and II (denoted by STD-I 
and STD-II, resp.) for the San Antonio, TX data 



Pij 


Estimates 


STD-I 


STD-II 


Pij 


Estimates 


STD-I 


STD-II 


Pll 


0.355 


0.0009 


0.0019 


P33 


0.046 


0.0026 


0.0041 


Pl2 


0.104 


0.0018 


0.0042 


P34 


0.232 


0.0032 


0.0132 


Pl3 


0.011 


0.0006 


0.0015 


P35 


0.106 


0.0061 


0.0082 


Pl4 


0.064 


0.0043 


0.0131 


P36 


0.039 


0.0025 


0.0080 


Pl5 


0.047 


0.0024 


0.0073 


P44 


0.436 


0.0100 


0.0155 


Pie, 


0.022 


0.0017 


0.0042 


P45 


0.240 


0.0123 


0.0094 


P22 


0.385 


0.0079 


0.0118 


P46 


0.105 


0.0057 


0.0141 


P23 


0.083 


0.0044 


0.0066 


P55 


0.233 


0.0080 


0.0130 


P24 


0.242 


0.0053 


0.0237 


P56 


0.109 


0.0045 


0.0168 


P25 


0.112 


0.0107 


0.0144 


P66 


0.537 


0.0093 


0.0263 


P26 


0.064 


0.0037 


0.0058 











j,k = 1, ... ,36, a = 1, ... ,21, where is the ith subsample version of 9kn 
and / = 575 — i + 1 = 559. Following the result on the optimal order of the 
block size for estimation of (co)-variances in the block resampling literature 
[cf. Lahiri (2003)], here we have set i = cN^/^ with N = 575 and c = 2. 

Table 3 gives the estimated standard errors of the least squares estimators 
of the 21 parameters 6i, ... ,621. 

From the table, it is evident that the estimates generated by Gap Boot- 
strap I are consistently smaller than those produced by Gap Bootstrap II. 
To verify the presence of serial correlation within columns, we also computed 
the component-wise sample autocorrelation functions (ACFs) for each of ori- 
gin and destination time series (not shown here). From these, we found that 
there is nontrivial correlation in all other series up to lag 14 and that the 
ACFs are of different shapes. In view of the nonstationarity of the compo- 
nents and the presence of nontrivial serial correlation, it seems reasonable to 
infer that Gap Bootstrap I underestimates the standard error of the split pro- 
portion estimates in the synthetic OD model and, hence. Gap Bootstrap II 
estimates may be used for further analysis and decision making. 

7. Concluding remarks. In this paper we have presented two resampling 
methods that are suitable for carrying out inference on a class of massive 
data sets that have a special structural property. While naive applications of 
the existing resampling methodology are severely constrained by the com- 
putational issues associated with massive data sets, the proposed methods 
exploit the so-called "gap" structure of massive data sets to split them into 
well-behaved smaller subsets where judicious combinations of known resam- 
pling techniques can be employed to obtain subset-wise accurate solutions. 
Some simple analytical considerations are then used to combine the piece- 
wise results to solve the original problem that is otherwise intractable. As 
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is evident from the discussions earher, the versions of the proposed Gap 
Bootstrap methods require different sets of regularity conditions for their 
vahdity. Method I requires that the different subsets (in our notation, rows) 
have approximately the same distribution and that the number of such sub- 
sets be large. In comparison. Method II allows for nonstationarity among 
the different subsets and does not require the number of subsets itself to go 
to infinity. However, the price paid for a wider range of validity for Method 
II is that it requires some analytical considerations [cf. (3.4)] and that it uses 
more complex resampling methodology. We show that the analytical consid- 
erations are often simple, specifically for asymptotically linear estimators, 
which cover a number of commonly used classes of estimators. Even in the 
nonstationary setup, such as in the regression models associated with the 
real data example, finding the weights in (3.4) is not very difficult. In the 
moderate scale simulation of Section 5, Method II typically outperformed 
all the resampling methods considered here, including, perhaps surprisingly, 
the block bootstrap on the entire data set; This can be explained by noting 
that unlike the block bootstrap method. Method II crucially exploits the gap 
structure to estimate different parts by using a suitable resampling method 
for each part separately. On the other hand, Method I gives a "quick and 
simple" alternative for massive data sets that has a reasonably good per- 
formance whenever the data subsets are relatively homogeneous and the 
number of subsets is large. 

APPENDIX: PROOFS 

For clarity of exposition, we first give a relatively detailed proof of The- 
orem 4.2 in Section A.l and then outline a proof of Theorem 4.1 in Sec- 
tion A. 2. 

A.l. Proof of consistency of Method II. 

A. 1.1. Conditions. Let {'Yt}t£Z be a d-dimensional time series on a 
probability space {0,,J-,P) with strong mixing coefficient 

a(n) = snp{\P{A n B) - P{A)P{B)\ : ^ G J"^, S G a G Z}, n > 1, 

where Z = {0, ±1, ±2, . . .} and where = a{Yt : t G [a, 6] n Z) for -oo < 
a < 6 < oo. We suppose that the observations {X^ : t = 1, . . . , n} are obtained 
from the Y^-series with systematic deletion of Y^-subseries of length q, as 
described in Section 2.2, leaving a gap of q in between two columns of X, 
that is, (Xi, . . . ,Xp) = (Yi, . . . , Yp), (Xp+i, . . . ,X2p = (Yp+g+i, . . . , Y2p+g), 
etc. Thus, for i = 0, . . . , m — 1 and j = 1, . . . 

Xjp+j = Yj(p_|_^)_|_j. 

Further, suppose that the vectorized process {(Xjp+i, . . . , X(j^i)p) : i > 0} is 
stationary. Thus, the original process {Y^} is nonstationary, but it has a 
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periodic structure over a suitable subset of the index set, as is the case in 
the transportation data example. Note that these assumptions are somewhat 
weaker than the requirements in (2.2). Also, for each j = l,...,p, denote 
the i.i.d. bootstrap observations generated by Efron (1979) 's bootstrap by 
{'X.*p^j : i = 0, . . . , m — 1} and the bootstrap version of Ojn by 6*^. Write £"* 
and Var* to denote the conditional expectation and variance of the bootstrap 
variables. 

To prove the consistency of the Gap bootstrap II variance estimator, we 
will make use of the following conditions: 

(C.l) There exist C G (0, oo) and 6 G (0, oo) such that fov j = 1, . . . ,p, 
and Yl'7i=i 

a(n)V(2+<5) 

< oo. 

(C.2) [On - E%iWjJjn] = o{n~^l^) in L\P). 
(C.3) (i) For j = l,...,p, 

m—1 

^jn = m-' Yl V'i(Xip+j) + o(m-V2) in l\P). 
i=0 

(ii) For j = l,...,p, 

m—1 

i+e-i 

^ln= E V'j(X(,„i)p+,)+o(ri/2) mL\P),i = l,...,I. 

a=i 

(C.4) g ^ OO and p^^^j^ = 0(1) as n ^ OO. 

We now briefly comment on the conditions. Condition (C.l) is a stan- 
dard moment and mixing condition used in the literature for convergence of 
the series X^^^i Cov('0j(Xj), ■0j(Xfcp_|_j)) [cf. Ibragimov and Linnik (1971)]. 
Condition (C.2) is a stronger form of (3.4). It guarantees asymptotic equiv- 
alence of the variances of 9n and its subsample (row)-based approximation 
Yl^j=i'^3ndjn- Condition (C.3) in turn allows us to obtain an explicit ex- 
pression for the asymptotic variance of Ojn and, hence, of On- Note that the 
linear representation of Ojn in (C.3) holds for many common estimators, in- 
cluding M, L and R estimators, where the Lp'{P) convergence is replaced by 
convergence in probability. The L{P) convergence holds for M-estimators 
under suitable monotonicity conditions on the score function; for L and R- 
estimators, it also holds under suitable moment condition on Xj's and under 
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suitable growth conditions on the weight functions. Condition (C.3)(ii) re- 
quires that a Hnear representation similar to that of the row-wise estimator 
9jn holds for its i.i.d. bootstrap version 6**^. If the bootstrap variables ^ip-^-j 
are defined on (0,7-", P) (which can always be done on a possibly enlarged 
probability space), then the iterated expectation E[E^{R*^}'^] is the same 

as the unconditional expectation and the first part of (C.2)(ii) can 

be simply stated as 

m— 1 

^In = E ^^(^*P+j) + o{m-^'^) in L\P). 
1=0 

The second part of (C.2)(ii) is an analog of (C.2)(i) for the subsample ver- 
sions of the estimators 9jnS. The remainder term here is o{e-^/^), as the 
subsampling estimators are now based on i columns of X^-variables as op- 
posed to m columns for 9jnS. All the representations in condition (C.3) hold 
for suitable classes of M, L and R estimators, as described above. 

Next consider condition (C.4). It requires that the gap between the Yt 
variables in two consecutive columns of X go to infinity, at an arbitrary 
rate. This condition guarantees that the i.i.d. bootstrap of Efron (1979) 
yields consistent variance estimators for the row-wise estimators 9jnS, even 
in presence of (weak) serial correlation. The second part of condition (C.4) 
is equivalent to requiring wjn — 0(1) for each j — 1, . . . ,p, when p is fixed. 
For simplicity, in the following we only prove Theorem 4.2 for the case p 
is fixed. However, in some applications, "p — ?> oo" may be a more realistic 
assumption and, in this case. Theorem 4.2 remains valid provided the order 
symbols in (C.3) have the rate o(m~^/^) uniformly over j G {!,..., p}, in 
addition to the other conditions. 

A.l. 2. Proofs. Let 91 = Yfj=iWjn9jn and 9^^ = rn-'^YZ^o^ "^jO^ip+j)' 
j = 1,. . . ,p. Let K denote a generic constant in (0, oo) that does not depend 
on n. Also, unless otherwise specified, limits in order symbols are taken by 
letting n — >■ cx). 



(A.l 



Proof of Theorem 4.2. First we show that 

p p 

Yari9n)-Y.Y1 

WjnWkn Cov{9jn, 9kn) 

j=l k=l 

3t 



n 



0(1). 



Let An = 9n- 9l. Note that by condition (C.2), EA^ = o(l). Hence, by the 
Cauchy-Schwarz inequality, the left side of (A.l) equals 

n\E{en-E9n)^-E{9i-E9if\ 

< 2n\E{9i - E9i){An - EAn)\ + n Var(A„) 
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<2n[Var(0t)]i/2(^^2)i/2^^^2 

= o{l), 

provided Var(4) = 0(l). 

To see that Var (6'n) = 0(1), note that 

m—l 

(A.2) = E^l^j {Xjf + 2m-i ^ (m - A:)^;^^ (X.OV'j i^kp+j) 

k=l 

as, by conditions (C.l) and (C.4), 

m— 1 

2m"i ^ (m - A;)|^Vi(Yj)Vi(Yfe(p+,)+,-)l 

m—l 



k=l 



<C'2/(2+5)^ ^ a(it)^/(2+^)=o(l). 
fc=p+IJ 

By similar arguments, for any 1 <j,k <p, 

(A.3) mCov(e]„,eij =i?^,(X,)^fc(Xfc) + 0(1). 

Also, by (A.2) and conditions (C.3) and (C.4), 

p 

nYai:{el) = Var(^j.„) + 2n ^ |wjn';«fcn||Cov(^j„, 4n)| 





p 








=0(1) 









Hence, (A.l) follows. 

To complete the proof of the theorem, by (A.l), it now remains to show 
that 

(A.4) m[a|„-Var(^j„)] = Op(l), 

(A.5) pn{j, k) - pn{j, k) = Op(l) 

for all 1 < j,k < p, where pn{j,k) is the correlation between 6jn and O/^n- 
First consider (A.4). Note that by (A.2), mVar(^j„) = Eil^j{Xjf + o{l) and 
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ma]^ = mYaT^ ^m~^ ^ + Op(l). 

By using a truncation argument and the mixing condition (C.4), it is easy 
to show that 

m— 1 

"i""^ J2 [V'i(Xip+i)]'' = E[i;j(yiip+j)Y' + Op(l), r = 1, 2. 

i=0 

Hence, (A. 4) follows. Next, to prove (A. 5), note that by condition (C.3), 
(A.2) and (A.3), 

for all j,k. Also, by conditions (C.3)-(C.4) and standard variance bound 
under the moment and mixing conditions of (C.4), for all j, k, 

i=l i=l 

where ^j^*^ = J^atfi""^ '4'j{'^{a~i)p+j)^ i = 1, . . . , I . The consistency of the sam- 
pling window estimator of pn{j,k) can now be proved by using conditions 
(C.2), (C.3) and standard results [cf. Theorem 3.1, Lahiri (2003)]. This com- 
pletes the proof of (A. 5) and hence of Theorem 4.2. 

Proofs of (3.9) and (3.10). For notational simplicity, w.l.g., we set 
fj, = 0. (Otherwise, replace Xt hy Xt — fj, for all t in the following steps.) 
Write Xjn and X^'^^ respectively, for the sample averages of the jth row 
and kth column, 1 < j <p and 1 < k <m. First consider (3.9). Since /i = 0, 
it follows that for each j ^ {1, . . . ,p}, 

m m 
i=l i=l 

Since n = mp, using a similar argument, it follows that 9n = n^^ J2i^=i -^f + 
Op(n-i) = p-i ^jn + Op{n~^). Hence, (3.9) holds. 

Next consider (3.10). It is easy to check that for all k = 1, . . . ,m, 6^^^ = 
P~^ELi^ik-i)p+i - l^'-''^? and E[X(>'^]^=p~H'J:1. Hence, with Wk = 

n 

L=n-^^X! + OJn-^) 



i=l 
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m 
k=l 

m m 

= ^ 9^^^ + p-^l'Sl + m-^ ^Wk + Op{n-^) 

k=l k=l 

m 

k=l 

provided condition (C.l) holds with tpjix) = x'^ for ah j. Further, note 
that the leading part of the Oj,(n~^/^)-term is n"^^"^ x Jn-"^^^ X^^Li ^fc 
777,-1/2 X "^^^i Wk is asymptotically normal with mean zero and variance 
= Var(T^i) + 2 Cov(VFi, VFi+i). As a result, the Op(n-i/2)-term 
cannot be of a smaller order (except in the special case of = 0). □ 

A.2. Proof of consistency of Method I. 

A.2.1. Conditions. We shall continue to use the notation and conven- 
tions of Section A. 1.2. In addition to assuming that X satisfies (2.2), we 
shall make use of the following conditions: 

(A.l) (i) Pairwise distributions of {m^/'^{6jn — 0):l<j<p} are identical. 

(ii) {m}/'^{6jn — 0) -.1 < j <p\ are mo-dependent with mo = o{p). 
(A. 2) (i) ?nVar(0i„) — )• S and m Cov(0i.„, 02n) — >■ A as n — > cx). 

(ii) {[m}-/'^{6in — 0)]^ : n > 1} is uniformly integrable. 

(iii) mp~^ Si=i ^^^'(^jn) S as n — > oo. 

Now we briefly comment on the conditions. As indicated earlier, for the 
validity of the Gap Bootstrap I method, we do not need the exchangeability 
of the rows of X; the amount of homogeneity of the centered and scaled 
row-wise estimators {m^/^(^j„ — ^) : 1 < J < p}-, as specified by condition 
(A.l)(i), is all that is needed. (A.l)(i) also provides the motivation behind 
the definition of the variance estimator of the pair-wise differences right 
above (3.3). Condition (A.l)(ii) has two implications. First, it quantifies the 
approximate independence condition in (2.2). A suitable strong mixing con- 
dition can be used instead, as in the proof of Theorem 4.2, but we do not 
attempt such generalizations to keep the proof short. A second implication 
of (A.l)(ii) is that p — t- oo as n — >■ cxd, that is, the number of subsample esti- 
mators OjnS must be large. In comparison, mo may or may not go to infinity 
with n — >■ oo. Next consider condition (A. 2). Condition (A.2)(i) says that the 
row- wise estimators are root-m consistent and that for any pair j ^ k, the 
covariance between 711^^'^ (^Ojn — 0^ cind. tTi^ ^ (^0 i^Yi — ^) hcis sl comnion limit, 
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which is what we are indirectly trying to estimate using mYai{6jn — Okn)- 
Condition (A.2)(ii) is a uniform integrabihty condition that is imphed by 
E\m}'^{ein - ^2n)P+^ = 0(1) [cf. condition (C.l)] for some 5 > 0. Part (iii) 
of condition (A. 2) says that the i.i.d. bootstrap variance estimator apphed 
to the (average of the) row-wise estimators be consistent. A proof of this can 
be easily constructed using the arguments given in the proof of Theorem 4.2, 
by requiring some standard regularity conditions on the score functions that 
define the OjnS in Section 3.2. We decided to state it as a high level condition 
to avoid repetition of similar arguments and to save space. 

A. 2. 2. Proof of Theorem 4^.1. In view of condition (A. 2) (iii) and (3.3), 
it is enough to show that 

m[Var(^l„ - ^2n) - Eihn - 02n)i9ln - 92ny] 0. 

Since this is equivalent to showing component-wise consistency, without loss 
of generality, we may suppose that the 9jnS are one-dimensional. 

Define Vjk = miOjn - 9knf'l{\m^/'^{6jn - Okn)\ > an), Wjk = m{9jn - 
^fcn)^l(|"^^''^(^jn — ^fcn.)| ^ ^n) , for some a„ € (0,00) to be specified later. 
It is now enough to show that 

Qln = ^ \Vjk - EVjkl -^p 0> 



Q2n = P 



-2 



^ [W^k-EW^k] 

i<j^k<p 

By condition (A.2)(ii), {[m^/^(ein - 92n)? -n > 1} is also uniformly inte- 
grable and, hence, 

EQm < 2E\m^/\ein - 92nfl{\7n^/\9in - ^2n)| > an) = 0(1) 

whenever a„ — )■ oo as n — )• oo. Next consider Q2n- Define the sets Ji = 
{{j, k):l<j ^k< p}, Aj^k = {{ji,h) € Ji : min{|j - \k-ki\}< mo} and 
Ej,k = Ji\ ^j,k, (j, k) £ Ji. Then, for any {j, k) G Ji, by the mo-dependence 
condition, 

Cov{Wjk,Wa,b) = for all (a, 6) € S^, fc. 

Further, note that = the size of Aj^i^. is at most 2mQp for all (j. A;) E Ji. 

Hence, it follows that 



EQln<P' 



■(j,fc)6 Ji (j,fc)eJi {a,b)^{j,k) 

P^EW^2+ E E \C0Y{Wjk,Wab) 
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< p-^[p^alE\Wi2\+p^ ■ 2moP • alE\Wi2\] 
= O{p~^7noal) 

as £'|VFi2| < mE{6in - 62nf = 0{l). Now choosing a„ = [p/mo]"^/^ (say), 
we get Qkn -^p for A; = 1,2, proving (A. 6). This completes the proof of 
Theorem 4.1. □ 
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